

Tcl Regular Expressions – Greedy or Non-greedy?

I came across an interesting problem while writing a small regular expression which would return a multi-line match from a ssh_config file.

Using the following ssh_config as an example:

Host polecat
HostName polecat.qcode.co.uk
User gerry
IdentityFile ~/.ssh/id_gerry_rsa

Host stoat
HostName stoat.qcode.co.uk
User fergus
IdentityFile ~/.ssh/id_fergus_rsa

 Host ferret
 HostName ferret.qcode.co.uk
 User farquar
 IdentityFile ~/.ssh/id_farquar_rsa

Host weasel
HostName weasel.qcode.co.uk
User crawford
IdentityFile ~/.ssh/id_crawford_rsa

The requirement was to match an entire host clause. So, that would start matching at Host,
and end immediately before the next Host clause, but not include that Host string in the match.

Firstly, we have to make sure we do not activate newline sensitive matching (regexp -line)
because our match will span multiple lines. This ensures ^ matches only the beginning of
the string, and $ matches only the end. Newlines, \n, are just part of the string.

To match the point immediately before the next Host clause, we take advantage of the very
useful positive look-ahead constraint (?=re) which will match a point where substring re begins,
without matching the substring itself. So to constrain a match to the position before whitespace
followed by Host, or the end of the string, we can use (?:(?=\n\s*Host\s)|$) (the ?:‘s are used
so this grouping isn’t used as a submatch).

We can step though the evolution of the regular expression, using the -nocase -inline switches
to display our match.

The first version to match only the first host clause was the following:

{^Host\s+polecat\s+.*(?:(?=\n\s*Host\s)|$)}

This starts a match at the start of the string, and stops before another Host clause begins.
But this will not work since it will attempt to match as much as possible eg.

tclsh8.5 [~]regexp -nocase -inline {^Host\s+polecat\s+.*(?:(?=\n\s*Host\s)|$)} $config
{Host polecat
HostName polecat.qcode.co.uk
User gerry
IdentityFile ~/.ssh/id_gerry_rsa

Host stoat
HostName stoat.qcode.co.uk
User fergus
IdentityFile ~/.ssh/id_fergus_rsa

 Host ferret
 HostName ferret.qcode.co.uk
 User farquar
 IdentityFile ~/.ssh/id_farquar_rsa

Host weasel
HostName weasel.qcode.co.uk
User crawford
IdentityFile ~/.ssh/id_crawford_rsa

}

We need to change our quantifier to use non-greedy matching to make the match as small
as possible. You might think that the following would be sufficient (with the bulk of
the match being covered by the quantifier .*?). But no..

{^Host\s+polecat\s+.*?(?:(?=\n\s*Host\s)|$)}

tclsh8.5 [~]regexp -nocase -inline {^Host\s+polecat\s+.*?(?:(?=\n\s*Host\s)|$)} $config
{Host polecat
HostName polecat.qcode.co.uk
User gerry
IdentityFile ~/.ssh/id_gerry_rsa

Host stoat
HostName stoat.qcode.co.uk
User fergus
IdentityFile ~/.ssh/id_fergus_rsa

 Host ferret
 HostName ferret.qcode.co.uk
 User farquar
 IdentityFile ~/.ssh/id_farquar_rsa

Host weasel
HostName weasel.qcode.co.uk
User crawford
IdentityFile ~/.ssh/id_crawford_rsa

}

This is due to how Tcl decides whether greedy or non-greedy matching is used.

If we look at the Tcl reference section on Matching we can see:

    A branch has the same preference as the first quantified atom in it which has a preference.

So although it is the content matched by .*? which we wish to change, its matching behaviour is
governed by the first quantified atom, which is in Host\s+polecat.
Therefore, to get the behaviour we want, rather counterintuitively, we need to modify
the first quantifier:

{^Host\s+?polecat\s+.*(?:(?=\n\s*Host\s)|$)}

tclsh8.5 [~]regexp -nocase -inline {^Host\s+?polecat\s+.*(?:(?=\n\s*Host\s)|$)} $config
{Host polecat
HostName polecat.qcode.co.uk
User gerry
IdentityFile ~/.ssh/id_gerry_rsa}

A great way of debugging regular expression is to use the -about switch.
In our original example we get the following output:

tclsh8.5 [~]regexp -about -nocase -inline {^Host\s+polecat\s+.*(?:(?=\n\s*Host\s)|$)} $config
0 {REG_ULOOKAHEAD REG_UNONPOSIX REG_ULOCALE}

This outputs the regular expression’s descriptive flags. In this case we have, REG_ULOOKAHEAD
which shows the regular expression contains a lookahead. REG_UNONPOSIX which shows this is not
a POSIX regular expression. And finally, REG_ULOCALE, which indicates a dependancy on locale.

One flag NOT present is, REG_USHORTEST. It shows the regular expression is looking for the shortest match.

Checking our second version, we can see the addition of REG_USHORTEST:

tclsh8.5 [~]regexp -about -nocase -inline {^Host\s+?polecat\s+.*(?:(?=\n\s*Host\s)|$)} $config
0 {REG_ULOOKAHEAD REG_UNONPOSIX REG_ULOCALE REG_USHORTEST}

The final thing to do with our ssh_config regular expression is to allow it to match host clauses which
do not appear at the beginning of the config string. This would seem quite straight forward, but there
is a small surprise.. we lose non-greedy matching again even though the first quantifier is still
specified as non-greedy:

{(?:^|\n|\s)Host\s+?stoat\s+.*(?:(?=\n\s*Host\s)|$)}

tclsh8.5 [~]regexp -nocase -inline {(?:^|\n|\s)Host\s+?stoat\s+.*(?:(?=\n\s*Host\s)|$)} $config
{
Host stoat
HostName stoat.qcode.co.uk
User fergus
IdentityFile ~/.ssh/id_fergus_rsa

 Host ferret
 HostName ferret.qcode.co.uk
 User farquar
 IdentityFile ~/.ssh/id_farquar_rsa

Host weasel
HostName weasel.qcode.co.uk
User crawford
IdentityFile ~/.ssh/id_crawford_rsa

}

tclsh8.5 [~]regexp -about -nocase -inline {(?:^|\n|\s)Host\s+?stoat\s+.*(?:(?=\n\s*Host\s)|$)} $config
0 {REG_ULOOKAHEAD REG_UNONPOSIX REG_ULOCALE}

This is likely to be due to another Matching rule, namely:

    An RE consisting of two or more branches connected by the | operator prefers longest match.

So how to turn this into a non-greedy match? We have no quantifier which we can change to non-greedy matching.

The answer is to ADD a such a quantifier specifically to be able to this. So using the benign quantifier {1,1}
which says, allow 1 through to 1 occurances of the previous match, we can apply this to the whole regular
expression and turn it into a non-greedy match. No other non-greedy specifier is needed:

{(?:(?:^|\n|\s)Host\s+stoat\s+.*(?:(?=\n\s*Host\s)|$)){1,1}?}

tclsh8.5 [~]regexp -about -nocase -inline {(?:(?:^|\n|\s)Host\s+stoat\s+.*(?:(?=\n\s*Host\s)|$)){1,1}?} $config
0 {REG_ULOOKAHEAD REG_UBOUNDS REG_UNONPOSIX REG_ULOCALE REG_USHORTEST}

tclsh8.5 [~]regexp -nocase -inline {(?:(?:^|\n|\s)Host\s+stoat\s+.*(?:(?=\n\s*Host\s)|$)){1,1}?} $config
{
Host stoat
HostName stoat.qcode.co.uk
User fergus
IdentityFile ~/.ssh/id_fergus_rsa}

